{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Calling the SHARE API\n",
    "----\n",
    "Here are some working examples of how to query the current scrAPI database for metrics of results coming through the SHARE Notifiation Service.\n",
    "\n",
    "These particular queries are just examples, and the data is open for anyone to use, so feel free to make your own and experiment!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Service Names for Reference\n",
    "----\n",
    "Each provider harvested from uses a shortened name for its source. Let's make an API call to generate a table to get all of those short names, along with the official name of the repository that they represent.\n",
    "\n",
    "The SHARE API has different endpoints. One of those endpoints returns a list of all of the providers that SHARE is harvesting from, along with their short names, official names, links to their homepages, and a simple version of an icon representing their service, in a parsable format called json.\n",
    "\n",
    "Let's make a call to that API endpoint using the requests libarary, get the json data, and print out all of the shortnames and longnames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "doepages - Department of Energy Pages\n",
      "scholarsphere - ScholarSphere @ Penn State University\n",
      "scholarsarchiveosu - ScholarsArchive@OSU\n",
      "calhoun - Calhoun: Institutional Archive of the Naval Postgraduate School\n",
      "scholarworks_umass - ScholarWorks@UMass Amherst\n",
      "cambridge - Apollo @ University of Cambridge\n",
      "texasstate - DSpace at Texas State University\n",
      "osf - Open Science Framework\n",
      "lwbin - Lake Winnipeg Basin Information Network\n",
      "uow - Research Online @ University of Wollongong\n",
      "oaktrust - The OAKTrust Digital Repository at Texas A&M\n",
      "umich - Deep Blue @ University of Michigan\n",
      "cogprints - Cognitive Sciences ePrint Archive\n",
      "utktrace - Trace: Tennessee Research and Creative Exchange\n",
      "stcloud - The repository at St Cloud State\n",
      "smithsonian - Smithsonian Digital Repository\n",
      "csuohio - Cleveland State University's EngagedScholarship@CSU\n",
      "biomedcentral - BioMed Central\n",
      "crossref - CrossRef\n",
      "purdue - PURR - Purdue University Research Repository\n",
      "unl_digitalcommons - DigitalCommons@University of Nebraska - Lincoln\n",
      "scholarscompass_vcu - VCU Scholars Compass\n",
      "dataone - DataONE: Data Observation Network for Earth\n",
      "cmu - Carnegie Mellon University Research Showcase\n",
      "nist - NIST MaterialsData\n",
      "udel - University of Delaware Institutional Repository\n",
      "u_south_fl - University of South Florida - Scholar Commons\n",
      "rcaap - RCAAP - Repositório Científico de Acesso Aberto de Portugal\n",
      "utaustin - University of Texas at Austin Digital Repository\n",
      "mason - Mason Archival Repository Service\n",
      "kent - Digital Commons @ Kent State University Libraries\n",
      "ut_chattanooga - University of Tennessee at Chattanooga\n",
      "asu - Arizona State University Digital Repository\n",
      "zenodo - Zenodo\n",
      "calpoly - Digital Commons @ CalPoly\n",
      "ghent - Ghent University Academic Bibliography\n",
      "neurovault - NeuroVault.org\n",
      "plos - Public Library of Science\n",
      "tdar - The Digital Archaeological Record\n",
      "vtech - Virginia Tech VTechWorks\n",
      "ncar - Earth System Grid at NCAR\n",
      "duke - Duke University Libraries\n",
      "scholarsbank - Scholars Bank University of Oregon\n",
      "datacite - DataCite MDS\n",
      "huskiecommons - Huskie Commons @ Northern Illinois University\n",
      "uky - UKnowledge @ University of Kentucky\n",
      "pcurio - Pontifical Catholic University of Rio de Janeiro\n",
      "trinity - Digital Commons @ Trinity University\n",
      "pdxscholar - PDXScholar Portland State University\n",
      "iastate - Digital Repository @ Iowa State University\n",
      "sldr - Speech and Language Data Repository (SLDR/ORTOLANG)\n",
      "cuscholar - CU Scholar University of Colorado Boulder\n",
      "pubmedcentral - PubMed Central\n",
      "cyberleninka - CyberLeninka - Russian open access scientific library\n",
      "mblwhoilibrary - WHOAS at MBLWHOI Library\n",
      "springer - Springer\n",
      "shareok - SHAREOK Repository\n",
      "dryad - Dryad Data Repository\n",
      "spdataverse - Scholars Portal dataverse\n",
      "noaa_nodc - National Oceanographic Data Center\n",
      "bhl - Biodiversity Heritage Library OAI Repository\n",
      "krex - K-State Research Exchange\n",
      "dash - Digital Access to Scholarship at Harvard\n",
      "lshtm - London School of Hygiene and Tropical Medicine Research Online\n",
      "waynestate - Digital Commons @ Wayne State\n",
      "opensiuc - OpenSIUC at the Southern Illinois University Carbondale\n",
      "hacettepe - Hacettepe University DSpace on LibLiveCD\n",
      "ucescholarship - eScholarship @ University of California\n",
      "icpsr - Inter-University Consortium for Political and Social Research\n",
      "iwu_commons - Digital Commons @ Illinois Wesleyan University\n",
      "npp_ksu - New Prairie Press at Kansas State University\n",
      "dailyssrn - Social Science Research Network\n",
      "caltech - CaltechAUTHORS\n",
      "wash_state_u - Washington State University Research Exchange\n",
      "harvarddataverse - Harvard Dataverse\n",
      "scitech - DoE's SciTech Connect Database\n",
      "mit - DSpace@MIT\n",
      "erudit - Érudit\n",
      "nih - NIH Research Portal Online Reporting Tools\n",
      "pcom - DigitalCommons@PCOM\n",
      "columbia - Columbia Academic Commons\n",
      "figshare - figshare\n",
      "citeseerx - CiteSeerX Scientific Literature Digital Library and Search Engine\n",
      "addis_ababa - Addis Ababa University Institutional Repository\n",
      "arxiv_oai - ArXiv\n",
      "digitalhoward - Digital Howard @ Howard University\n",
      "wustlopenscholarship - Washington University Open Scholarship\n",
      "clinicaltrials - ClinicalTrials.gov\n",
      "iowaresearch - Iowa Research Online\n",
      "valposcholar - Valparaiso University ValpoScholar\n",
      "chapman - Chapman University Digital Commons\n",
      "upennsylvania - University of Pennsylvania Scholarly Commons\n",
      "elife - eLife Sciences\n",
      "uwashington - ResearchWorks @ University of Washington\n",
      "uiucideals - University of Illinois at Urbana-Champaign, IDEALS\n"
     ]
    }
   ],
   "source": [
    "import requests\n",
    "\n",
    "data = requests.get('https://osf.io/api/v1/share/providers/').json()\n",
    "\n",
    "for source in data['providerMap'].keys():\n",
    "    print(\n",
    "        '{} - {}'.format(\n",
    "            data['providerMap'][source]['short_name'],\n",
    "            data['providerMap'][source]['long_name'].encode('utf-8')\n",
    "        )\n",
    "    )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### SHARE Schema\n",
    "\n",
    "You can make queries against any of the fields defined in the [SHARE Schema](https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml). If we were able to harvest the information from the original source, it should appear in SHARE. However, not all fields are required for every document. \n",
    "\n",
    "Required fields include:\n",
    "- title\n",
    "- contributors\n",
    "- uris\n",
    "- providerUpdatedDateTime\n",
    "\n",
    "We add some information after each document is harvested inside the field shareProperties, including:\n",
    "- source (where the document was originally harvested)\n",
    "- docID  (a unique identifier for that object from that source)\n",
    "\n",
    "These two fields can be combined to make a unique document identifier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Simple Queries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need a URL to use to access the SHARE API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "OSF_APP_URL = 'https://osf.io/api/v1/share/search/'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get the first 3 results from the most basic query - the first page of the most recently updated research release events in SHARE.\n",
    "\n",
    "We'll use the URL parsing library furl to keep track of all of our arguments to the URL, because we'll be modifying them as we go along. We'll print the URL as we go to take a look at it, so we know what we're requesting.\n",
    "\n",
    "We'll print out the result's title, original source, and when it was updated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The request URL is https://osf.io/api/v1/share/search/?size=3&sort=providerUpdatedDateTime\n",
      "----------\n",
      "At the Crossroads: On Fairytales, Firebirds, and Real Life Choices -- from iwu_commons -- updated at 2016-01-26T16:47:20+00:00\n",
      "School psychology 2010: Demographics, employment, and the context for professional practices – Part 1 -- from u_south_fl -- updated at 2016-01-26T14:25:24+00:00\n",
      "Principles of Biology -- from npp_ksu -- updated at 2016-01-26T13:30:58+00:00\n"
     ]
    }
   ],
   "source": [
    "import furl\n",
    "\n",
    "search_url = furl.furl(OSF_APP_URL)\n",
    "search_url.args['size'] = 3\n",
    "search_url.args['sort'] = 'providerUpdatedDateTime'\n",
    "recent_results = requests.get(search_url.url).json()\n",
    "\n",
    "print('The request URL is {}'.format(search_url.url))\n",
    "print('----------')\n",
    "for result in recent_results['results']:\n",
    "    print(\n",
    "        '{} -- from {} -- updated at {}'.format(\n",
    "            result['title'].encode('utf-8'),\n",
    "            result['shareProperties']['source'],\n",
    "            result['providerUpdatedDateTime']\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's limit that query to only documents mentioning \"giraffes\" somewhere in the title, description, or in any of the metadata. We'd do that by adding a query search term."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The request URL is https://osf.io/api/v1/share/search/?size=3&sort=providerUpdatedDateTime&q=giraffes\n",
      "---------\n",
      "Odd creature was ancient ancestor of today’s giraffes -- from crossref -- updated at 2015-11-24T00:00:00+00:00\n",
      "Naturalized seeing/colonial vision : interrogating the display of races in late nineteenth century France -- from datacite -- updated at 2015-11-11T23:35:35+00:00\n",
      "Integration of complex shapes and natural patterns -- from datacite -- updated at 2015-11-07T03:06:45+00:00\n"
     ]
    }
   ],
   "source": [
    "search_url.args['q'] = 'giraffes'\n",
    "recent_results = requests.get(search_url.url).json()\n",
    "\n",
    "print('The request URL is {}'.format(search_url.url))\n",
    "print('---------')\n",
    "for result in recent_results['results']:\n",
    "    print(\n",
    "        '{} -- from {} -- updated at {}'.format(\n",
    "            result['title'].encode('utf-8'),\n",
    "            result['shareProperties']['source'],\n",
    "            result['providerUpdatedDateTime']\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's search for documents from the source mit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The request URL is https://osf.io/api/v1/share/search/?size=3&sort=providerUpdatedDateTime&q=shareProperties.source:mit\n",
      "---------\n",
      "Similitude: Interfacing a Traffic Simulator and Network Simulator with Emulated Android Clients -- from mit -- updated at 2016-01-25T18:35:04+00:00\n",
      "MobiStreams: A Reliable Distributed Stream Processing System for Mobile Devices -- from mit -- updated at 2016-01-25T18:30:59+00:00\n",
      "Approximate cone factorizations and lifts of polytopes -- from mit -- updated at 2016-01-25T18:25:44+00:00\n"
     ]
    }
   ],
   "source": [
    "search_url.args['q'] = 'shareProperties.source:mit'\n",
    "recent_results = requests.get(search_url.url).json()\n",
    "\n",
    "print('The request URL is {}'.format(search_url.url))\n",
    "print('---------')\n",
    "for result in recent_results['results']:\n",
    "    print(\n",
    "        '{} -- from {} -- updated at {}'.format(\n",
    "            result['title'].encode('utf-8'),\n",
    "            result['shareProperties']['source'],\n",
    "            result['providerUpdatedDateTime']\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's combine the two and find documents from MIT that mention giraffes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The request URL is https://osf.io/api/v1/share/search/?size=3&sort=providerUpdatedDateTime&q=shareProperties.source:mit+AND+giraffes\n",
      "---------\n",
      "Giraffes, religion and conflict : essays in behavioral decision making -- from mit -- updated at 2015-08-03T20:00:59\n"
     ]
    }
   ],
   "source": [
    "search_url.args['q'] = 'shareProperties.source:mit AND giraffes'\n",
    "recent_results = requests.get(search_url.url).json()\n",
    "\n",
    "print('The request URL is {}'.format(search_url.url))\n",
    "print('---------')\n",
    "for result in recent_results['results']:\n",
    "    print(\n",
    "        '{} -- from {} -- updated at {}'.format(\n",
    "            result['title'].encode('utf-8'),\n",
    "            result['shareProperties']['source'],\n",
    "            result['providerUpdatedDateTime']\n",
    "        )\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Complex Queries\n",
    "The SHARE Search API runs on elasticsearch - meaning that it can accept complicated queries that give you a wide variety of information.\n",
    "\n",
    "Here are some examples of how to make more complex queries using the raw elasticsearch results. You can read a [lot more about elasticsearch queries here](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://osf.io/api/v1/share/search/'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "search_url.args = None  # reset the args so that we remove our old query arguments.\n",
    "search_url.url # Show the URL that we'll be requesting to make sure the args were cleared"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Query Setup\n",
    "\n",
    "We can define a few functions that we can reuse to make querying simpler. Elasticsearch queries are passed through as json blobs specifying how to return the information you want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "def query_share(url, query):\n",
    "    # A helper function that will use the requests library,\n",
    "    # pass along the correct headers,\n",
    "    # and make the query we want\n",
    "    headers = {'Content-Type': 'application/json'}\n",
    "    data = json.dumps(query)\n",
    "    return requests.post(url, headers=headers, data=data, verify=False).json()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Some Queries\n",
    "The SHARE schema has many spots for information, and many of the original sources do not provide this information. We can do a query to find out if a certain field exists or not within certain records. The SHARE API is set up to not display the field if it is empty.\n",
    "\n",
    "Let's query for the first 5 all documents that have a sponsorship field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sponsorship_query = {\n",
    "    \"size\": 5,\n",
    "    \"query\": {\n",
    "        \"filtered\": {\n",
    "            \"filter\": {\n",
    "                \"exists\": {\n",
    "                    \"field\": \"sponsorships\"\n",
    "                }\n",
    "            }\n",
    "        }\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A Phase III, Randomized, Comparative, Open-label Study of Intravenous Iron Isomaltoside 1000 (Monofer®) Administered as Maintenance Therapy by Single or Repeated Bolus Injections in Comparison With Intravenous Iron Sucrose in Subjects With Stage 5 Chronic Kidney Disease on Dialysis Therapy (CKD-5D) -- from source clinicaltrials -- sponsored by Pharmacosmos A/S \n",
      "-------------------\n",
      "Phase IB Study of FOLFIRINOX Plus PF-04136309 in Patients With Borderline Resectable and Locally Advanced Pancreatic Adenocarcinoma -- from source clinicaltrials -- sponsored by Washington University School of Medicine National Cancer Institute (NCI)\n",
      "-------------------\n",
      "Discontinuation of Infliximab Therapy in Patients With Crohn's Disease During Sustained Complete Remission: A National Multi-center, Double Blinded, Randomized, Placebo Controlled Study -- from source clinicaltrials -- sponsored by Copenhagen University Hospital at Herlev \n",
      "-------------------\n",
      "Temperature Evaluation by MRI Thermometry During Cervical Cooling -- from source clinicaltrials -- sponsored by University of Vermont Cryothermic Systems. Inc.\n",
      "-------------------\n",
      "Combination of Lenalidomide and Ofatumumab in Patients With Previously Treated Chronic Lymphocytic Leukemia and Small Lymphocytic Lymphoma (CLL/SLL) -- from source clinicaltrials -- sponsored by M.D. Anderson Cancer Center GlaxoSmithKline\n",
      "-------------------\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/erin/.virtualenvs/tuts/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:768: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html\n",
      "  InsecureRequestWarning)\n"
     ]
    }
   ],
   "source": [
    "results = query_share(search_url.url, sponsorship_query)\n",
    "\n",
    "for item in results['results']:\n",
    "#     print(item['sponsorships'])\n",
    "    print('{} -- from source {} -- sponsored by {}'.format(\n",
    "            item['title'].encode('utf-8'),\n",
    "            item['shareProperties']['source'].encode('utf-8'),\n",
    "            ' '.join(\n",
    "                [sponsor['sponsor']['sponsorName'] for sponsor in item['sponsorships']]\n",
    "            )\n",
    "        )\n",
    "    )\n",
    "    print('-------------------')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's see how many results do not have tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tags_query = {\n",
    "    \"query\": {\n",
    "        \"query_string\": {\n",
    "            \"analyze_wildcard\": True, \n",
    "            \"query\": \"NOT tags:*\"\n",
    "        }\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4157872 results out of 4272328, or 97.32%, do not have tags.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/erin/.virtualenvs/tuts/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:768: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html\n",
      "  InsecureRequestWarning)\n"
     ]
    }
   ],
   "source": [
    "results_with_tags = query_share(search_url.url, tags_query)\n",
    "total_results = requests.get(search_url.url).json()['count']\n",
    "results_percent = (float(results_with_tags['count'])/total_results)*100\n",
    "\n",
    "print(\n",
    "    '{} results out of {}, or {}%, do not have tags.'.format(\n",
    "        results_with_tags['count'],\n",
    "        total_results,\n",
    "        format(results_percent, '.2f')\n",
    "    )\n",
    ")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using SHAREPA for SHARE Parsing and Analysis\n",
    "\n",
    "While you can always pass raw elasticsearch queries to the SHARE API, there is also a pip-installable python library that you can use that makes elasticsearch aggregations a little simpler. This library is called [sharepa - short for SHARE Parsing and Analysis](https://github.com/CenterForOpenScience/sharepa#sharepa)\n",
    "\n",
    "### Basic Actions\n",
    "\n",
    "A basic search will provide access to all documents in SHARE in 10 document slices.\n",
    "\n",
    "#### Count\n",
    "You can use sharepa and the basic search to get the total number of documents in SHARE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4272328"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sharepa import basic_search\n",
    "\n",
    "basic_search.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Iterating Through Results\n",
    "Executing the basic search will send the actual basic query to the SHARE API and then let you iterate through results, 10 at a time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Avian community structure and incidence of human West Nile infection\n",
      "Rat12_a\n",
      "Non compact continuum limit of two coupled Potts models\n",
      "\n",
      "Simultaneous Localization, Mapping, and Manipulation for Unsupervised\n",
      "  Object Discovery\n",
      "Synthesis of High-Temperature Self-lubricating Wear Resistant Composite Coating on Ti6Al4V Alloy by Laser Deposition\n",
      "Comparative Studies of Silicon Dissolution in Molten Aluminum Under Different Flow Conditions, Part I: Single-Phase Flow\n",
      "Scrambling of data in all-optical domain\n",
      "Step behaviour and autonomic nervous system activity in multiparous dairy cows during milking in a herringbone milking system\n",
      "<p>Typical features of the constant velocity forced dissociation process in the SGP-3-ligated 1G1Q 2CR complex system.</p>\n"
     ]
    }
   ],
   "source": [
    "results = basic_search.execute()\n",
    "\n",
    "for hit in results:\n",
    "    print(hit.title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we don't want 10 results, or we want to offset the results, we can use slices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Elements of Trust in Named-Data Networking\n",
      "Effect of Perceived Attributions about Ostracism on Social Pain and Task Performance\n",
      "Millimeter Wave MIMO Channel Tracking Systems\n",
      "Metric Dimension and Zero Forcing Number of Two Families of Line Graphs\n",
      "The Glassey conjecture on asymptotically flat manifolds\n"
     ]
    }
   ],
   "source": [
    "results = basic_search[20:25].execute()\n",
    "for hit in results:\n",
    "    print(hit.title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Advanced Search with sharepa\n",
    "\n",
    "You can make your own search object, which allows you to pass in custom queries for certain terms or SHARE fields. Queries are formed using [lucene query syntax](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax), just like we used in the above examples.\n",
    "\n",
    "This type of query accepts a 'query_string'. Other options include a match query, a multi-match query, a bool query, and any other query structure available in the elasticsearch API.\n",
    "\n",
    "We can see what that query that we're about to send to elasticsearch by using the pretty print helper function. You'll see that it looks very similar to the queries we defined by hand earlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"query\": {\n",
      "        \"query_string\": {\n",
      "            \"analyze_wildcard\": true, \n",
      "            \"query\": \"NOT tags:*\"\n",
      "        }\n",
      "    }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "from sharepa import ShareSearch\n",
    "from sharepa.helpers import pretty_print\n",
    "\n",
    "my_search = ShareSearch()\n",
    "\n",
    "my_search = my_search.query(\n",
    "    'query_string', # Type of query, will accept a lucene query string\n",
    "    query='NOT tags:*', # This lucene query string will find all documents that don't have tags\n",
    "    analyze_wildcard=True  # This will make elasticsearch pay attention to the asterisk (which matches anything)\n",
    ")\n",
    "\n",
    "pretty_print(my_search.to_dict())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you execute that query, you can then iterate through the results the same way that you could with the simple search query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Avian community structure and incidence of human West Nile infection\n",
      "Non compact continuum limit of two coupled Potts models\n",
      "\n",
      "Simultaneous Localization, Mapping, and Manipulation for Unsupervised\n",
      "  Object Discovery\n",
      "Synthesis of High-Temperature Self-lubricating Wear Resistant Composite Coating on Ti6Al4V Alloy by Laser Deposition\n",
      "Comparative Studies of Silicon Dissolution in Molten Aluminum Under Different Flow Conditions, Part I: Single-Phase Flow\n",
      "Scrambling of data in all-optical domain\n",
      "Step behaviour and autonomic nervous system activity in multiparous dairy cows during milking in a herringbone milking system\n",
      "<p>Typical features of the constant velocity forced dissociation process in the SGP-3-ligated 1G1Q 2CR complex system.</p>\n",
      "The elusive shepherdess\n"
     ]
    }
   ],
   "source": [
    "new_results = my_search.execute()\n",
    "for hit in new_results:\n",
    "    print(hit.title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Debugging and Problem Solving\n",
    "\n",
    "Not everything always goes as planned when querying an unfamillar API. Here are some debugging and problem solving strategies when you're querying the SHARE API.\n",
    "\n",
    "### Schema issues\n",
    "The SHARE schema has a lot of parts, and much of the information is nested within sections. Making a query isn't always as straight forward as you might think, if you're not looking in the right part of the schema.\n",
    "\n",
    "Let's say you were trying to query for all SHARE documents that specify the language as not being in English.\n",
    "\n",
    "We'll guess as to what that query might be, and try to make it using sharepa."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "language_search = ShareSearch()\n",
    "\n",
    "language_search = language_search.query(\n",
    "    'query_string', # Type of query, will accept a lucene query string\n",
    "    query='NOT languages=english', # This lucene query string will find all documents that don't have tags\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "ename": "AttributeError",
     "evalue": "'Result' object has no attribute 'languages'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-19-f7e8b99a9f64>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mhit\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mresults\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m     \u001b[0;32mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mhit\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlanguages\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m/Users/erin/.virtualenvs/tuts/lib/python2.7/site-packages/elasticsearch_dsl/utils.pyc\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, attr_name)\u001b[0m\n\u001b[1;32m    107\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    108\u001b[0m             raise AttributeError(\n\u001b[0;32m--> 109\u001b[0;31m                 '%r object has no attribute %r' % (self.__class__.__name__, attr_name))\n\u001b[0m\u001b[1;32m    110\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    111\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mAttributeError\u001b[0m: 'Result' object has no attribute 'languages'"
     ]
    }
   ],
   "source": [
    "results = language_search.execute()\n",
    "\n",
    "for hit in results:\n",
    "    print(hit.languages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the result does not have an attribute called languages! Let's try to figure out what went wrong here.\n",
    "\n",
    "Step one could be that we are trying to find something that does NOT match a given parameter. Since languages is not required, this is returning results that do not include the languages result at all!\n",
    "\n",
    "So let's fix this up a bit to make sure that we're querying for items that specify language in the first place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "language_search = ShareSearch()\n",
    "\n",
    "language_search = language_search.filter(\n",
    "    'exists',\n",
    "    field=\"languages\"\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 155407 documents with languages specified\n",
      "Here are the languages for the first 10 results:\n",
      "[u'ger']\n",
      "[u'fre']\n",
      "[u'eng']\n",
      "[u'eng']\n",
      "[u'eng']\n",
      "[u'eng']\n",
      "[u'eng']\n",
      "[u'eng', u'fre']\n",
      "[u'eng']\n",
      "[u'eng']\n"
     ]
    }
   ],
   "source": [
    "results = language_search.execute()\n",
    "\n",
    "# Let's see how many documents have language results.\n",
    "print('There are {} documents with languages specified'.format(language_search.count()))\n",
    "\n",
    "print('Here are the languages for the first 10 results:')\n",
    "\n",
    "# Check out the first few results\n",
    "for hit in results:\n",
    "    print(hit.languages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So now we're better equipped to add on to this filter, and then narrow down to results that are not in English.\n",
    "\n",
    "When we printed out the first few results, we might have noticed a second problem with our query -- going back to the [SHARE Schema](https://github.com/CenterForOpenScience/SHARE-Schema/blob/master/share.yaml), we might notice that there is a restriction on how languages are captured - as a three letter lowercase representation. Instead of \"english\" let's look for the three letter abbreviation - \"eng\"\n",
    "\n",
    "We can modify our new and improved language query by adding on another query to our started language_search. We'll use the elasticsearch query object Q, and invert it with a ~ symbol, and search for the term \"eng.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 4007 documents that do not have \"eng\" listed.\n",
      "Here are the languages for the first 10 results:\n",
      "[u'ger']\n",
      "[u'fre']\n",
      "[u'ger']\n",
      "[u'fre']\n",
      "[u'lat']\n",
      "[u'lat']\n",
      "[u'lat']\n",
      "[u'lat']\n",
      "[u'fre']\n",
      "[u'tha']\n"
     ]
    }
   ],
   "source": [
    "from elasticsearch_dsl import Q\n",
    "\n",
    "language_search = language_search.query(~Q(\"term\", languages=\"eng\"))\n",
    "\n",
    "results = language_search.execute()\n",
    "\n",
    "# Let's see how many documents have language results that aren't eng\n",
    "print('There are {} documents that do not have \"eng\" listed.'.format(language_search.count()))\n",
    "\n",
    "print('Here are the languages for the first 10 results:')\n",
    "\n",
    "# Check out the first few results, make sure \"eng\" isn't in there\n",
    "for hit in results:\n",
    "    print(hit.languages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
